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A common belief in high-dimensional data analysis is that data are concentrated on a low- 
dimensional manifold. This motivates simultaneous dimension reduction and regression on man- 
ifolds. We provide an algorithm for learning gradients on manifolds for dimension reduction for 
high-dimensional data with few observations. We obtain generalization error bounds for the 
gradient estimates and show that the convergence rate depends on the intrinsic dimension of 
the manifold and not on the dimension of the ambient space. We illustrate the efficacy of this 
approach empirically on simulated and real data and compare the method to other dimension 
reduction procedures. 
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1. Introduction 

The inference problems associated with high-dimensional data offer fundamental chal- 
lenges - the scientifically central questions of model and variable selection - that lie at 
the heart of modern statistics and machine learning. A promising paradigm in addressing 
these challenges is the observation or belief that high-dimensional data arising from phys- 
ical or biological systems can be effectively modeled or analyzed as being concentrated on 
a low-dimensional manifold. In this paper we consider the problem of dimension reduction 
- finding linear combinations of salient variables and estimating how they covary - based 
upon the manifold paradigm. We are particularly interested in the high-dimensional data 
setting, where the number of variables is much greater than the number of observations, 
sometimes called the "large p, small n" paradigm [22]. 

The idea of reducing high-dimensional data to a few relevant dimensions has been 
extensively explored in statistics, computer science and various natural and social sci- 
ences. In machine learning the ideas in isometric feature mapping (ISOMAP) [20], local 
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linear embedding (LLE) [18], Hessian eigenmaps [8] and Laplacian eigenmaps [2] are 
all formulated from the manifold paradigm. However, these approaches do not use re- 
sponse variates in the models or algorithms and hence may be suboptimal with respect 
to predicting response. In statistics the ideas developed in sliced inverse regression (SIR) 
[13], (conditional) minimum average variance estimation (MAVE) [23] and sliced aver- 
age variance estimation (SAVE) [6] consider dimension reduction for regression prob- 
lems. The response variates are taken into account and the focus is on predictive linear 
subspaces called either effective dimension reduction (e.d.r.) space [23] or central mean 
subspace [5]. These approaches do not extend to the manifold paradigm. In [15, 16] 
the method of learning gradients was introduced for variable selection and regression in 
high-dimensional analysis for regression and binary regression setting. This method, in 
machine learning terminology, is in spirit a supervised version of Hessian eigenmaps and, 
in statistics terminology, can be regarded as a nonparametric extension of MAVE. This 
approach can be extended to the general manifold setting. The main purpose of this 
paper is to explore this idea. 

The inference problem in regression is estimating the functional dependence between 
a response variable Y and a vector of explanatory variables X 

Y = f(X) + e 

from a set of observations D = { (xi , y;)}?=i where X E X C M. p has dimension p and Y E R 
is a real valued response for regression and Y £ {±1} for binary regression. Typically the 
data are drawn i.i.d. from a joint distribution, (xi,yi) ~ p(X,Y). We may in addition 
want to know which variables of X are most relevant in making this prediction. This 
can be achieved via a variety of methods [4, 12, 21]. Unfortunately, these methods and 
most others do not provide estimates of covariance for salient explanatory variables and 
cannot provide the e.d.r. space or central mean subspace. Approaches such as SIR [13] 
and MAVE [23] address this shortcoming. 

SIR and its generalized versions have been successful in a variety of dimension reduction 
applications and provide almost perfect estimates of the e.d.r. spaces once the design 
conditions are fulfilled. However, the design conditions are limited and the method fails 
when the model assumptions are violated. For example, quadratic functions or between 
group variances near zero violate the model assumptions. In addition, since SIR finds 
only one direction, its applicability to binary regression is limited. 

MAVE and the outer product of gradient (OPG) method [23] are based on estimates of 
the gradient outer product matrix either implicitly or explicitly. They estimate the central 
mean subspace under weak design conditions and can capture all predictive directions. 
However, they cannot be directly used for "large p, small n" setting due to overfitting. 
The learning gradient method in [15, 16] estimates the gradient of the target function by 
nonparametric kernel models. It can also be used to compute the gradient outer product 
matrix and realize the estimation of the central mean subspace by the same manner as 
OPG (see Section 4 below). Moreover, this method can be directly used for the "large p, 
small n" setting because the regularization technique prevents overfitting and guarantees 
stability. 
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All the above methods have been shown to be successful by simulations and appli- 
cations. However we would like a theoretical and conceptual explanation of why this 
approach to dimension reduction is successful with very few samples and many dimen- 
sions. Conceptually this reduces to the following analysis: For a target function on a 
nonlinear manifold, the gradient outer product matrix defined in the Euclidean can still 
be used to estimate predictive directions even when the gradient is not well defined on the 
ambient space. Theoretically, we notice that the consistency results for MAVE and OPG 
[23] and learning gradients [15, 16] provide asymptotic rates of order 0(?i~ 1 / p ). Clearly 
this is not satisfactory and does not support practical applicability when p is large, es- 
pecially for the setting where p^> n. Intuitively one should expect that the rate would 
be a function not of the dimension of the ambient space but of the intrinsic dimension 
of the manifold. 

In this paper we extend the learning gradient algorithms from the Euclidean setting to 
the manifold setting to address these questions. Our two main contributions address the 
conceptual and theoretical issues above. From a conceptual perspective we will design 
algorithms for learning the gradient along the manifold. The algorithm in the Euclidean 
setting can be applied without any modifications to the manifold setting. However, the 
interpretation of the estimator is very different and the solutions contain information 
on the gradient of the target function along the manifold. This interpretation provides 
a conceptual basis for using the usual p-dimensional gradient outer product matrix for 
dimension reduction. From a theoretical perspective, we show that the asymptotic rate 
of convergence of the gradient estimates is of order 0(n _1 / d ) with d being the intrinsic 
dimension of the manifold. This suggests why in practice these methods perform quite 
well, since d may be much smaller than n though p> n. 

The paper will be arranged as follows. In Section 2 we develop the learning gradient 
algorithms on manifolds. The asymptotic convergence is discussed in Section 3, where we 
show that the rate of convergence of the gradient estimate is of order 0(?i~ 1 / d ). In Section 
4 we explain why dimension reduction via gradient estimates has a solid conceptual basis 
in the manifold setting and discuss relations and comparisons to existing work. Simulated 
and real data are used in Section 5 to verify our claims empirically and closing remarks 
and comments are given in Section 6. 

2. Learning gradients 

In this section we first review the gradient estimation method on Euclidean spaces pro- 
posed in [15, 16]. Then after a short discussion of Taylor series expansion on manifolds 
we formulate learning gradients under the manifold setting. 

2.1. Learning gradients in Euclidean space 

In the standard regression problem the target is the regression function defined by the 
conditional mean of Y\X, that is, f r = ¥,y[Y\X]. The objective of learning gradients is 
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to estimate the gradient 



Vf= (9fr dUV 



where x = (x , . . . ,x p ) <E M. p . The learning gradient algorithm developed in [16] is moti- 
vated by fitting first-order differences. 

Recall f r is the minimizer of the variance or mean square error functional, 

f r (x) = E(Y\X = x) = argmin Var(/) where Var(/) = E(Y - f(X)f. 

When only a set of samples D are available the functional is usually approximated em- 
pirically 

VM(/)«±f>-/0ri)) 2 . 
i=l 

Using the first-order Taylor series expansion approximating a smooth function / by 

f(x) ps f(u) + V/(x) -(x-u) for x ps u, 
the variance of / may be approximated as 

1 " 

where my is a weight function that ensures the locality of Xi ps and thus my — > 
as 1 1 x.; — xj | — > oo . The weight function my is typically characterized by a bandwidth 
parameter, for example, a Gaussian with the bandwidth as the standard deviation my = 

e -||x 1 -x J || 2 /(2 S 2 )_ 

Learning gradient algorithms were specifically designed for very high-dimensional data 
but with limited number of observations. For regression the algorithm was introduced in 
[16] by nonparametric reproducing kernel Hilbert space (RKHS) models. The estimate 
of the gradient is given by minimizing (2.1) with regularization in an RKHS 



fo := arg min 

f&e K 



Wi ^yj - Vi - f( x ^ • ( x j - - x *)) 2 + x \\f\\ 2 K 



(2.2) 



where T-Lk = 7-Lk{X) is a reproducing kernel Hilbert space (RKHS) on X associated with 
a Mercer kernel K (for the definition and properties of RKHS, see [1]) and % P K is the 

space of p functions /=(/],,..., f p ) where fa e U K , \\f\\ 2 K = Yn=\ WMk and A > 0. 

With the weight function chosen to be the Gaussian with standard variance s 2 , a finite 
sample probabilistic bound for the distance between /d and V/ r is provided in [16], 
which implies the convergence of the gradient estimate to the true gradient, fo —> V/ r , 
at a slow rate, 0(?i _1 / p ). 
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For binary classification problems where Y = {±1}, the learning gradient algorithm 
was introduced in [15]. The idea is to use the fact that the function 



f c {x) :=log 



Prob(y = l\x) 



Prob(j/ = -l| 



= log 



p(y = l\x) 



p(y = -l\x) 



is given by 



/ c = argminE^(F/(X)) 



with <p(t) = log(l +e _t ). Notice that the Bayes optimal classification function is given 
by sgn(/ c ), the sign of f c . In the binary classification setting we learn the gradient of f c . 
Applying the first-order Taylor expansion of / we have for the given data D 

E<t>(Yf(X)) « ± E ViMViifixj) + V/(x,) • ( Xi - Xj ))). 

Modeling / and V/ by a real valued function g and a vector valued function /, respec- 
tively, leads to the empirical risk 

1 ™ 

£<t>,D{g,f) = -2 E w a < t ) (yi(9(xj) + f(xi)- (xi-xj))). 

Minimizing this empirical risk with a regularization term gives the following algorithm 
G?^,/^)=arg min (S^ D {g, f) + Xi\\g\\ 2 K + X 2 \\f\\ 2 K ), (2.3) 

where Ai,A2 are the regularization parameters. A finite sample probabilistic bound for 
the distance from g$ t D to f c and f^.D to V/ c is provided in [15], which leads to a very 
slow rate of order, 0(n~ x / p ). 



2.2. Gradients and Taylor expansions on Riemannian manifolds 

In order to extend learning gradients to the manifold setting, it is necessary to formulate 
gradients and first-order Taylor expansions on manifolds. To do this we need to intro- 
duce some concepts and notation from Riemannian geometry. We introduce only what is 
needed so that we can stress concepts over technical details. For a complete and rigorous 
formulation, see [7]. 

The two key concepts are vector fields and the exponential map. Let M. be a d- 
dimensional smooth (i.e., C°°) Riemannian manifold and d J v((a,6) be the Riemannian 
distance on A4 between two points a,b € M. The tangent space at a point q £ M. is a 
d-dimensional linear space and will be denoted by T q M.. There exists an inner product 
on this tangent space (•, -) q that defines the Riemannian structure on A4. 
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A vector field on a manifold is an assignment to every point q on the manifold tangent 
vector in T q M. The gradient of a smooth function / on X, Vjvi/, is a vector field 
satisfying 

(Vju/fa), «) 9 = v(f) for all v e T q M , q g M 
It can be represented using an orthonormal basis {e^, . . . , e^} of T q M. as the vector 

V M /(<7) = (ei/,...,4/). 

If the manifold is the Euclidean space (AJ = R d ), then one may take e\ = and the 
above definition reduces to the standard definition of gradients. 

The exponential map at a point q, denoted by exp 9 , is a map from the tangent space 
T q A4 to the manifold A4. It is defined by the the locally length- minimizing curve, the 
so-called geodesic. The image of v £ T q A4 is the end of a geodesic starting at q with 
velocity v and time 1. In general the exponential map is only locally defined in that 
it maps a small neighborhood of the origin in T q M. to a neighborhood of q on the 
manifold. Its inverse, exp^ 1 , maps the point exp q (v) to the vector (v 1 , . . . , v d ) € R d where 

v = J2i=i v%e i - This provides a local chart for the neighborhood of q and {ef } are called 
the g-normal coordinates of this neighborhood. 

Under the q-normal coordinates the gradient vector field Vjm/ takes the form 
Vtw/(<?) = V(/ o exp g )(0). Note that / o exp 9 is a smooth function on T q M = M d . The 
following first-order Taylor series expansion holds: 

(foexp g )(v) w (/ocxp 9 )(0) + V(/oexp g )(0) ■ v for ti«0. 

This gives us the following Taylor expansion of / around a point q € M. : 

/(oq> g (v))«/(ff) + <VA4/(?),«>, foTV€T q M,v^0. (2.4) 

The above expansion does not depend on the choice of the coordinate system at q. 

2.3. Learning gradients on Riemannian manifolds 

In the manifold setting, the explanatory variables are assumed to concentrate on an un- 
known d-dimensional Riemannian manifold M and there exists an isometric embedding 
ip: M — > R p with every point on Ai described by a vector in R p . When a set of points 
are drawn from the marginal distribution p M concentrated on M. what we know are not 
the points {<Zi}" =1 € M. themselves but their images under ip:Xi = ip{qi). 

To formulate the learning gradient algorithm for the regression setting, we apply the 
first-order Taylor expansion (2.4). The empirical approximation of the variance by the 
data {{quVi)} is 

1 " 

Var(/) w — 22 w iAVi -Vi~ (^Mf(q t ),v l3 ) q f, (2.5) 
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where £ T ?i 7W is the tangent vector such that qj = exp g . (vij ) . One may immediately 
notice the difficulty that Vij is not computable without knowing the manifold. A natural 
idea to overcome this difficulty is to represent all quantities in W. This is also compatible 
with the fact that we are given images of the points Xi = <p{qi) £ K p rather than the d- 
dimensional representation on the manifold. 

Suppose x = ip(q) and £ = ip(exp q (v)) for q£ M and v £ T q Ai. Since ip is an isometric 
embedding, i.e., dip q :T q Ai — > T X R P = M. p is an isometry for every q £ Ai, the following 
holds: 

where 

dtp q [v) sa (p(acp q (v)) — ip(q) — £ — x for v ~ 0. 
Applying these relations to the observations D = {(a^, 2/i)}"=i yields 

(VMf(qi),Vij) qi ps d^g^VTu/fe)) • (xj-Xj). (2.6) 

Notice further that d</? o V^/ is a p-dimcnsional vector valued function on ^(A^f) c K p 
defined by (dip o V M f)((p(q)) = dp q (\7 mKq)) for q £ M. Applying (2.6) to (2.5) and 
denoting dip o V.m/' by / yields 

1 

Var(/) ~£ D {f) = — w iAVj - Vi - f( x i) 1 ( x j ~ x *)f ■ 

n 1,3 = 1 

Minimizing this quantity leads to the following learning gradient algorithm on manifolds: 

Definition 2. 1 . Let M. be a Riemannian manifold and p : M — > W be an isometric 
embedding that is unknown. Denote X = p(AA) and T-Lk = Hk(X). For the sample D = 
{(qi, Ui)}f = i £ (Ai x R)", Xi — p>(qi) £ W, the learning gradient algorithm on A4 is 

f D := arg min {£ D (f) + \\\f\\ 2 K }, 

where the weights Wij , the RKHS, T~iF K , the RKHS norm \\f\\K o-nd the parameter A > 
are the same as in (2.2). 

Similarly we can deduce the learning gradient algorithm for binary classification on 
manifolds. 

Definition 2.2. Let A\ be a Riemannian manifold and p:A4 — >W be isometry em- 
bedding. Denote X = ip(Ai) and Hk = Hk(X)- For the sample D = {(<Zi, 2/i)}7=i G 
(Ai x y) n , Xi = p>(qi) £ W , the weighted empirical risk for g : X — > K and f : X — > W 
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is defined as 

n 

£<P,D{gJ) = W ij^(Vi(9(Xj) + f{Xi) ■ (Xi - Xjj)) 

and the algorithm for learning gradient on the manifold is 

G?^,Az?)=arg min (S <l> , D (g,f) + X 1 \\g\\ 2 K + X 2 \\f\\ 2 K ), (2.7) 

where Ai,A2 are the regularization parameters. 

Surprisingly these algorithms have forms that are identical to the learning gradient 
algorithms in Euclidean space. However, the geometric interpretation is different. Note 
that / in Definition 2.1 (or 2.2) models d<^(V M.fr) (or dip(V a-i/c)), not the gradient 
itself. 

3. Convergence rates as a function of the intrinsic 
dimension 

Given the interpretations of the gradient estimates developed in the previous section, it 
is natural to seek conditions and rates for the convergence of fo to d(/?(Vx/ r ). Since 
I = (dip)* (dip) , where I is the identity operator, the convergence to the gradient on the 
manifold is given by (d(p)*(fo) — > V ' jvifr- The aim of this section is to show that this 
convergence is true under mild conditions and provide rates. 

Throughout this paper we use the following exponential weight function with scale 
parameter s 2 , 

The following if -functional will enter our estimates 

JT(t)=w£ (||(d¥>)*/-vWr|li a +t\\ff K ). 
feH" K "m 

The following theorem provides upper bounds for the gradient estimate as a function 
of the if-functional. 

Theorem 3.1. Let M. be a compact Riemannian manifold with metric dj^t and let d/u be 

the uniform measure on M. Assume the marginal distribution p M satisfies the regularity 
conditions: 

(i) The density v(x) = Ap ^^ exists and for some c\ > and < 6 < 1 

\v(x) - v(u)\ < c 1 d e M (x,u) Vx,ueM. (3.1) 
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(ii) The measure along the boundary is small. There exists C2 > such that 

p M ({x£M:d M (x,dM) <*}) <c 2 t Vt > 0. (3.2) 

Suppose f r £ C 2 (A4 ) . There exists < £0 < 1 , which depends on M. only, and a constant 
C p > such that, if s < £q and A = s d + 2 + 26 , fften wit/i probability 1 — 5 (6 £ (0,1)) the 
following bound holds: 

II (d„) V, - v^Ml^ < c p (log |) a (^ + - 9 + + i) ^(2,-)) . 

The rate of convergence of the gradient estimate is an immediate corollary. 

Corollary 3.1. Under the assumptions of Theorem 3.1, if ,W(t) = 0(tP) for some ^ < 
fi < 1, then there exist sequences A = X(n) and s = s(n) such that 

\\(dtp)*f D - Vx/r|||a as 00. 

If S = r j- 1 / 2d + 4 + 5e ar j^ ,\ = ^ e ra £ e y conver g ence j s 

IKd^r/D-VWrlli, = (n- 9 ^- 1 )/( M+4+5e )). 

This result states that the convergence rate of learning gradient algorithms depends 
on the intrinsic dimension d of the manifold, not the dimension p of the ambient space. 
Under the belief that high-dimensional data have low intrinsic dimension d <C p, this 
explains why the learning gradient algorithms arc still efficient for high-dimensional data 
analysis even when there are limited observations. 

If Ai is a compact domain in MP where d = p, dip — (d(p)* = I and Vm is the usual 
gradient operator, our result reduces to the Euclidean setting that is proven in [16]. 

The upper bound in Theorem 3.1 is not as tight as possible and may not lead to con- 
vergence even when the gradient estimate converges. This is illustrated by the following 
case. We expect that if Hk is dense in C(M) then Jf(t) — > as t — > and the gradi- 
ent estimate should converge in probability to the true gradient. However, Corollary 3.1 
states that the convergence holds only when J(f(t) decays faster than Oft 1 / 2 ). This is a 
result of the proof technique we use. More sophisticated but less general proof techniques 
can give us better estimates and close the above gap; see Remark B.l in Appendix B for 
details. 

In case of a uniform distribution measure on the manifold, dp M = d^, we have the 
following improved upper bound that closes the gap. 

Theorem 3.2. Let Ai be compact, dp M = dfi and f r £ C 2 (M.). There exists <£q<1, 
which depends on A4 only, and a constant C M > such that, if s < Eq and A = s d+3 , then 
with probability 1 — 6 (5 £ (0, 1) ) the following bound holds: 

\\(d<p)*? D V M fr\\ 2 Ll < C, (log fj 2 (-L + s + (_L_ + 1) JT(2 S )) . 
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An immediate corollary of this theorem is that the rate of convergence of the gradient 
estimate does not suffer from the same gap as Corollary 3.1. 

Corollary 3.2. Under the assumptions of Theorem 3.2, if Jtf(t) — > as t 0, then 
\\(dtp)*f D — Vjw/rllia ->-0 asn^oo 

if A = s d+3 and s = s(n) is chosen such that 

, Jf(2s 2 ) 
s — > and Trrrz > as n— »oo. 

If in addition J(T(t) = 0(t' 3 ) for some < /3 < 1, i/iera wii/i i/ie choice s = ?7~ 1 /( 2ti +7) anrf 
A = s d+3 , we obtain 

\\(^yf D -V M f r \\l 2 =0(n-P'^). 



Note that dp^ = d/i implies z/ = 1 and hence (3.1) holds with 8 = 1. In this case the 
rate in Corollary 3.1 is 0(?i-( 2 ' 3 - 1 )/( 2d + 9 )). Since ^ > ff^, the rate in Corollary 3.2 
is better. 

The proofs of the above theorems will be given in Appendix B. 

For learning gradients on manifolds for binary classification problems the convergence 
(dip)* f,p t D — * VjV(/c with rate 0(n -1 / d ) can be obtained similarly. We omit the details. 

We close this section with some remarks. Note the operator (d<^)* is the projection 
onto the tangent space. The convergence results assert that the projection of fo asymp- 
totically approximates the gradient on the manifold. It may be more natural to consider 
convergence of fu to the gradient on the manifold (after mapping into the ambient 
space), that is, fu — > d<p(V>i/r)- Unfortunately this is not generally true. 

Convergence of learning algorithms that are adaptive to the manifold setting has been 
considered in the literature [3, 24]. In particular, in [3] local polynomial regression is 
shown to attain minimax optimal rates in estimating the regression function. Whether 
it is plausible to extend this to the gradient learning setting is not known. There are 
a few essential differences between our result and that of Bickel and Li [3]. Unlike our 
result, their result is pointwise in that convergence and error bounds depend on the point 
x G X and it is not obvious how to obtain Li convergence from pointwise convergence. 
In addition, Bickel and Li [3] assume a strong condition on the partial derivatives of the 
regression function in the ambient space. This may be problematic since these partial 
derivatives may not be well defined in the ambient space. Since we have different as- 
sumptions and a different sense of convergence, the minimax optimality of our results 
cannot be obtained directly using their arguments. Moreover, in our setting the optimal 
learning rates will also depend on the choice of the kernel. This is a very interesting open 
problem. 
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4. Dimension reduction via gradient estimates 

The gradient estimates can be used to compute the gradient outer product matrix and 
provide an estimate of the linear predictive directions or the e.d.r. space. 

4.1. Estimate gradient outer product matrix and e.d.r. space 

Let us start from a semi-parametric model: 

y = f r (x)+e = F(B T x)+e, 

where B = (b\, . . . , bk) <G M. pxk contains the k e.d.r. directions. When the input space is a 
domain of R p and f r is smooth, the gradient outer product matrix G with 




dfr dfr\ 



is well defined. It is easy to see that B is given by eigenvectors of G with non-zero 
eigenvalues. 

Using the gradient estimates fo that approximate V/ r , we can compute an empirical 
version of the gradient outer product matrix G by 

1 " 

G = - J2fD(xe)fD(xi) T - 

i=i 

Then the estimates of e.d.r. space can be given by a spectral decomposition of G. 

When the input space is a manifold, since the manifold is not known, we cannot 
really compute Vjn/ r through fo- We can only work directly on fo- So we propose to 
implement the dimension reduction by the same procedure as in the Euclidean setting, 
that is, we first compute G using fu and then estimate the e.d.r. space by the eigenvectors 
of G with top eigenvalues. The only problem is the feasibility of this procedure. The 
following theorem suggests that the irrelevant dimensions can be filtered out by a spectral 
decomposition of G. 

Theorem 4.1. Let Ai > A2 > • ■ ■ > A p be the eigenvalues and £ = 1, . . . ,p be the 

corresponding eigenvectors of G. Then for any £> k we have B T U£ — > 0^ and ujGug — > 
in probability. 

4.2. An alternative for classification 

The idea of dimension reduction by spectral decomposition of the gradient outer produc- 
tion matrix is to estimate the e.d.r. space by finding the directions associated with the 
directional partial derivative of the largest L? norm. Intuitively we think the L 2 norm 
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may have drawbacks for binary classification problems because a large value of the gra- 
dient often is located around the decision boundary, which usually has low density. Thus, 
an important predictive dimension may correspond to a directional partial derivative 
with a small L 2 norm and hence be filtered out. This motivates us to use L°° norm or 
Wk norm instead and provide an alternative method. 

By using the Hk norm we consider the gradient covariance matrix (EGCM) E defined 

by 

tii ■= /iK.-f'n., i< for l<i,j<p. (4.1) 

The eigenvectors for the top eigenvalues will be called empirical sensitive linear features 
(ESF). Estimating the e.d.r. by ESFs is an alternative method for the dimension reduction 
by gradient estimates and will be referred to as gradient-based linear feature construction 
(GLFC). 

Though using the L°° norm or T-Lk norm for classification seems to be more intu- 
itive than using the L 2 norm, empirically we obtain almost identical results using either 
method. 

4.3. Computational considerations 

At a first glance one may think it is problematic to compute the spectral decomposition 
of G or E when p is huge. However, due to the special structure of our gradient estimates, 
they can in fact be realized efficiently in both time, 0(n 2 p + n 3 ), and memory, 0(pn). 
In the following we comment on the computational issues for the EGCM. 

In both the regression [16] and the classification [15] settings the gradient estimates 
satisfy the following rcprcscntcr theorem: 

n 

foix) = y j Cj, D K(x,x i ). 

i=l 

A result of this representer property is that the EGCM has the following positive- 
semidefinite quadratic form 

E = c D KcJ), 

where cd — [ c i,d, ■ ■ ■ , Cu,d) € M px ™ and K is the nxn kernel matrix with Kij = K(xi,Xj). 
Since K is positive definite, there exists a lower-diagonal matrix L so that LL T = K . 
Then 

E = c D c]j, 

where cd = cqL ispxn matrix. An immediate result of this formula is that the EGCM 
is a rank n matrix and has at most n non-zero eigenvalues and therefore at most the 
top n empirical features will be selected as relevant ones. Efficient solvers for the first n 
eigenvectors of E using QR decomposition for cu [10] are standard and require 0(n 2 p) 
time and O(pn) memory. This observation conforms to the intuition that with n samples, 
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it is impossible to select more than n independent features or predict a function that 
depends on more than n independent variables. 



4.4. Relations to OPG method 

MAVE and the OPG method proposed in [23] share similar ideas by using the gradient 
outer product matrix implicitly or explicitly. Of them, OPG is more related to learning 
gradients. We discuss differences between the methods. 

At each point Xj the OPG method estimates the function value a, ~ fripj) an d gra- 
dient vector bj ps Wf r (xj) by 

n 

(a,j,bj) = arg min } w l - a - b T (x i - x,)] 2 . (4.2) 

i=l 

Then the gradient outer product matrix is approximated by 



J'=l 



Notice that if we set the kernel as K (x, u) = S X;U and A = the learning gradient algorithm 
reduces to (4.2). In this sense the learning gradient extends OPG from estimating the 
gradient vector only at the sampling points to estimating the gradients by a vector valued 
function. Hence the estimates extend to out-of-sample points. This offers the potential to 
apply the method more generally; for example, numerical derivatives in a low-dimensional 
space or function adaptive diffusion maps (see [17] for details). 

The solution to the minimization problem in (4.2) is not unique when p > n. This 
can result in overfitting and instability of the estimate of the gradient outer product 
matrix. In this sense OPG cannot be directly used for the "large p, small n" problem. 
The regularization term in the learning gradient algorithms helps to reduce overfitting 
and allows for feasible estimates in the "large p, small n" setting. However, we should 
remark that this is a theoretical and conceptual argument. In practice OPG can be used 
together with preprocessing the data using principal components analysis (PCA) and 
results in performance comparable to learning gradients. 



5. Results on simulated and real data 

In this section we illustrate the utility and properties of our method. We will focus on 
the "large p, small n" setting and binary classification problems. 

In the following simulations we always set the weight function as exp(— ^~ 2 "^ ) with 

s as the median of pairwise distance of the sampling points. The Mercer kernel K is 

ii — ii 2 

K(x,u) = exp(— " ) with a equal to 0.2 times the median of pairwise distance of the 
sample points. They may not be optimal but work well when p 3> n in our experience. 
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(a) Training data 



(b) EGCM 
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Figure 1. Linear classification simulation with a = 0.5. 



5.1. A linear classification simulation 

Data are drawn from two classes in an n = 100 dimensional space. Samples from class 
— 1 were drawn from 

o; J '= 1:10 ~N(1.5,<7), xi =U:20 ~N(-3,<r), ^= 21:100 ~ N(0,a). 

Samples from class +1 were drawn from 

^=41:60 ^ N (_i. 5j(r ) j ^= 51:60 ~N(3,a), ^= 1:40 ' 61:100 ~N(0,cr). 

Note that a measures the noise level and difficulty of extracting the correct dimensions. 
We drew 20 observations for each class from the above distribution as training data and 
another 40 samples are independently drawn as test data. By changing the noise level a 
from 0.2 to 3, we found our method stably finds the correct predictive directions when 
a < 2. From a prediction point of view the result is still acceptable when a > 2 though 
the estimates of the predictive dimension contain somewhat larger errors. In Figures 1 
and 2 we show the results for a = 0.5 and a = 2.5, respectively. 
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(a) Training data 
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(e) Projection of training data 



(e) Projection of test data 



20 40 60 80 



Figure 2. Linear classification simulation with a = 2.5. 



5.2. A nonlinear classification simulation 

Data arc drawn from two classes in a p-dimcnsional space. Only the first d-dimensions are 
relevant in the classification problem. For samples from class +1 the first d-dimensions 
correspond to points drawn uniformly from the surface of a d-dimcnsional hypersphcre 
with radius r, for class —1 the first 10 dimensions correspond to points drawn uniformly 
from the surface of a d-dimensional hypersphere with radius 2.5r. The remaining p-d- 
dimensions are noise 

x J '-N(0,cr) for j = d+l,...,p. 

Note that the data can be separated by a hypersphere in the first ci-dimensions. Therefore 
projecting the data onto the first d-ESFs for this problem should reflect the underlying 
geometry. 

In this simulation we set p = 200, d = 2, r = 3 and a varying from 0.1 to 1. The first 
two ESFs are shown to capture the correct underlying structure when a < 0.7. In Figure 
3 we give the result with a = 0.2. We also studied the influence of p and found when 
p < 50 the noise level can be as large as 1.0. 
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(d) The first 2 ESFs 



******* 



(e) Projection of training data 

++ * 

» + 



(e) Projection of test data 



Figure 3. Nonlinear classification simulation with a = 0.2. In (d) the first ESF is in blue and 
the second in red. 



5.3. Digit classification 



A standard data set used in the machine learning community to benchmark classification 
algorithms is the MNIST data set (Y. LeCun, http://yann.lecun.com/exdb/mnist/). 
The data set contains 60 000 images of handwritten digits {0, 1, 2, . . . , 9}, where each 
image consists of p = 28 x 28 = 784 grayscale pixel intensities. This data set is commonly 
believed to have strong nonlinear manifold structure. In this section we report results on 
one of the most difficult pairwisc comparisons: discriminating a handwritten "3" from an 
"8". 

In the simulation we randomly choose 30 images from each class and the remaining are 
used as the test set. We compare the following dimension reduction methods GLFC, SIR 
and OPG. In Table 1 we report the classification error rates by the k-NN classifier with 
k = 5 using the respective method for dimension reduction. The SIR results reported arc 
for a regularized version of SIR (RSIR) since SIR is not stable for very high-dimensional 
data. As mentioned before OPG cannot be directly applied so we first run PCA. Wc 
compare the results for using all PCs (PC-OPG) and 30 PCs (PC30-OPG), respectively. 
The last column is the error rate by k-NN without dimension reduction. 
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5.4. Gene expression data 

One problem domain where high dimensions are ubiquitous is the analysis and classifi- 
cation of gene expression data. We consider two classification problems based on gene 
expression data. One is the study using expression data to discriminate acute myeloid 
leukemia (AML) from acute lymphoblastic leukemia (ALL) [11] and another is the clas- 
sification of prostate cancer [19]. In the leukemia data set there are 48 samples of AML 
and 25 samples of ALL. The number of genes is p = 7129. The data set was split into 
a training set of 38 samples and a test set of 35 samples as specified in [11]. In the 
prostate cancer data, the dimension is p = 12 600. The training data contains 102 sam- 
ples, 52 tumor samples and 50 non-tumor samples. The independent test data contains 
34 samples from a different experiment. We applied GLFC, SIR and OPG to these two 
data sets and compared the accuracy using a linear support vector machine classifier. 
The leave-one-out (LOO) error over the training data and the test error are reported in 
Tables 2 and 3 for the leukemia data and prostate cancer data, respectively. For leukemia 
classification the two classes are well separated and all methods perform quite similarly. 
For prostate cancer classification, the accuracy in [19] is about 90%. All methods achieve 
similar accuracy on the training data. GLFC and OPG methods have better prediction 
power on test data. From our experiments (both for gene expression data and digits data) 
we see that the PC-OPG method performs quite similarly to GLFC when the number of 
top PCs is correctly set. But it seems quite sensitive to this choice. 

6. Discussion 

In this paper we first extended the gradient estimation and feature selection frame- 
work outlined in [15, 16] from the ambient space setting to the manifold setting. Con- 
vergence is shown to depend on the intrinsic dimension of the manifold but not the 
dimension of the ambient Euclidean space. This helps to explain the feasibility "large 



Table 1. Classification error rate for 3 vs. 8 



GLFC 


PC-OPG 


PC30-OPG 


SIR 


RSIR 


kNN (k 


0.0853 


0.1110 


0.0912 


0.1877 


0.0956 


0.1024 



Table 2. Classification error for leukemia data 





GLFC 


PC-OPG 


SIR 


SVM 


LOO error 


1 


1 


1 


1 


Test error 








1 


1 


Dimension 


d = 2 


d = 4 


d=l 


d = 7129 
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Table 3. Classification error for prostate cancer data 



GLFC PC-OPG PC50-OPG SIR 



SVM 



LOO error 9 
Test error 2 
Dimension d = 5 



15 
9 

d = 2 



10 
6 

d = l 



9 
9 

d=l 



9 
9 

d= 12600 



p, small n" problems. We outlined properties of this method and illustrated its util- 
ity for real and simulated data. Matlab code for learning gradients can be obtained at 
http : //www. stat . duke . edu/~sayan/sof t .html. 

We close by stating open problems and discussion points: 

1. Large p, not so small n: The computational approaches used in this paper require 
that n is small. The theory we provide does not place any constraint on n. The 
extension of the computational methods to larger n involves the ability to expand 
the gradient estimates in terms of an efficient bases expansion. For the approach 
proposed in this paper the number bases are at most n 2 , which is efficient for small 
n. 

2. Fully Bayesian model: The Tikhonov regularization framework coupled with the use 
of an RKHS allows us to implement a fully Bayesian version of the procedure in the 
context of Bayesian radial basis (RB) models [14]. The Bayesian RB framework can 
be extended to develop a proper probability model for the gradient learning problem. 
The optimization procedures in Definitions 2.1 and 2.2 would be replaced by Markov 
chain Monte Carlo methods and the full posterior rather than the maximum a 
posteriori estimate would be computed. A very useful result of this is that in addition 
to the point estimates for the gradient we would also be able to compute confidence 
intervals. 

Appendix A: Geometric background 

In this section, we introduce some properties on manifolds that we will need in our proofs. 
Let M be a Riemannian manifold and tp : M. — > W be an isometric embedding, that is, 
for every q£M, dip q : T q M -> T V ^W = W is isometric: 



This directly leads to the following conclusion. 

Lemma A.l. For every q £ M, the following hold: 

(i) (dtfq)* o (dtpg) — Ir q M; the identity operator on T q M.; 

(ii) (dtfq) o (d<p q )* is the projection operator of T v ^ q )W to its subspace d(p q (T q M) C 



dip q (v) ■ d(p q (vi) = {v,vx) q Vu, V\ <G T q M. 
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(iii) ||d^!| = ||(d^)*|| = i. 

Lemma A. 2. Let Ai be compact. There exists Eq > uniform in q £ Ai such that exp q 
is well defined on B q (eo) and is a diffeomorphism onto its image. Moreover, given an 
orthonormal basis {e[}f =l ofT q A4, under the q-normal coordinates defined by exp" 1 , if 
\v\ < Sq the following hold: 

(i) i < \/dct(G)(v) < | where G is the volume element. 

(ii) i|^<Mcxp 9 ( v ))-^)||<M 2 . 

(iii) ip(cxp q (v)) - (p(q) = d(p q (v) + 0(\v\ 2 ). 

Proof. This follows directly from the compactness of Ai and Proposition 2.2 in [9]. See 
[24] for a self-contained proof of a very similar result. □ 

Lemma A. 3. Let Ai be compact and Eq be given as in Lemma A. 2. If f & C 2 (Ai), then 
there exists a constant C > such that for all gG Ai and v G T q A4, \v\ < eq, 

\f(cxp q (v)) - f{q) - (V M f(q),v)\ < C\v\ 2 . 
Proof. Since / £ C 2 (Ai), f o cxp q (v) is C 2 on B q (eo). By the discussion in Section 2.2, 

\f(cxp q (v))-f(q)-(V M f(q),v)\ 

= \(fo cxp q )(v) - (/ o exp g )(0) - V(/ o cx Pg )(0) • v\ 

<C q \v\ 2 

with C q = sup veBq ( eo ) |V 2 (/ o exp q )(v)\. Since V 2 (/ o cxp q )(v) is continuous in q and M. 
is compact, C = sup q&M C q exists and our conclusion follows. □ 

We remark that if M. C W is a submanifold with intrinsic metric, /rp is an isometric 
embedding. If in addition M is a closed domain in W, then exp x (v) = x + v for 

Appendix B: Proofs of convergence results 

In the manifold setting we usually do not know the manifold or the embedding. It will 
be convenient to regard M. as a submanifold of W and dj^t as the intrinsic metric on 
the manifold. This implies that (p = Imp and we can identify Ai as X = ip(Ai). Note that 
the marginal distribution p M on Ai induces a distribution p x on X = (p(Ai). The above 
notation implies p M = p x . We denote 2> x = d(I^ P ) x and S 1 is the operator on vector fields 
such that @h(x) = @ x (h(x)) for h e TAi. Correspondingly, is the dual of $ x and <3* 
maps a p-dimcnsional vector valued function / to a vector field f(x) = 2>*(f(x)). We 
will adopt this notation to simplify our expression and give the proofs of the results in 
Section 3. 

Recall the following definition given in [16]. 
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Definition B.l. Denote Z = XxY = MxY. For s > and f :M — !> W, define the 
expected error 



Z J Z 



If we denote a 2 — J z J z e W x £H 2 /( 2s2 ) (y — f r (x)) 2 dp(x, y) dp(£, rf), then 

£{f)=2al+ [ [ er^ 2 /^ 2 Hf r (x)-f r (0 + f(x)-(t-x)) 2 dp M (x)dp M (0. 



M JM 



Define 



/ A = arg min {£(/) + X\\f\\ 2 K }. 

/en- 



It can be regarded as the infinite sample limit of Jd.x- The following decomposition holds 

\\2*fD-VMfrh' <4fD-fx\\K + \\®*fx-V M fr\\Li ■ 

This bounds 11^* /u — V m/AIl 2 by two terms. The first one is called sample error and 
the second is the approximation error. 

For the sample error, we have the following estimate. 

Proposition B.l. Assume \y\ < M almost surely. There are two constants C\,C<i > 
such that for any S > 0, with confidence 1 — 8, 

- - < C 1 log(2/J) f C 2 \og(2/6) 1\ ? 

\\JD - J\\\K < p=7 h 7=7 1 /A if- 

\ \/mA m I 



This estimate has in fact been given in [16]. To sec this, we notice two facts. First, 
our algorithm in Definition 2.1 is different from that in [16] only by a scalar. Second, the 
proof in [16] does not depend on the geometric structure of the input space. So a scalar 
argument leads to the above bound for the sample error directly. Of course, one may 
obtain better estimates by incorporating the specific geometric property of the manifold. 

Next we turn to estimate the approximation error. In the following, we always assume 
M. compact and set Eq to be the same as in Lemma A. 2. Without loss of generality, we 
also assume eq < 1. 

Proposition B.2. Assume (3.1) and (3.2). If f r £ C 2 (M), there is a constant C 3 > 
such that for all A > and s < Eq, 



Wh v M f r \\i lM < c 3 ^ + + - j [s 2 + Jtr v 

If dp M = dp, the estimate can be improved. 
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Proposition B.3. Let f r £ C 2 (A4). If dp M = d/i, then there exists a constant C^^ > 
such that for all A > and s < Eq, 

Wh ~ V M fr\\h < (s + + l) (s 2 + + S 2 



X J\ \s d + 2 

The proof of these bounds for the approximation error will be given in two steps. In 
the first step we bound the i 2 -differcncc by the expected error and in the second step 
the functional is used to control the expected error. 

Lemma B.l. Assume Condition 3.2 and f r £ C 2 (A4). There exists a constant C3 > so 
that for all s < Sq and f £ , 

Wf- Vju/lli^ < c 3 ((1 + ||/1k)V + ^e(S(f) 2a 2 )) . (B.l) 
//, in addition, dp M — dp,, then the estimate can be improved to 

||^*/-Vm/||L < c ^(l + \\f\\ K ) 2 s + ^(£(f)-2a 2 )) (B.2) 

P M \ S J 

for some c%^ > 0. 

Proof. Denote X s = {x £ M :d M (x,dM) > s and v(x) > (1 + c 2 )s e }. For x £ M, let 
Bx.s = {£, £ A4: oIm(£,, x) < s}. We prove the conclusion in three steps. 
Step 1. Define the local error function 



CT S (X)= / e-ll-fll A* >{f r {x)-fr(Q + f(x)-(Z-x)yMZ)- 

We claim that there exists a constant c' > such that for each x £ X s , 



- V M fA < c'((l + \\f\\ K ) 2 s 2 + ^er s (*)). (B.3) 

Since s < Eq, exp x is a homeomorphism from B x (s) onto B XjS . For every £ £ B XiS there 
exists v £ B x (s) so that £ = exp 2 . (v). Write every v £ T X A4 in normal coordinates. Then 
er s (a;) equals 



e -||*-ex P ^)|| /(2s ){fr{x) _ f r ( expx ( v )) + f( x ) . (ex Px (v) - x)f^t(G)(v) dv. 

B„(s) 

Denote 

h(v) = f r (x) - / P (exp») + f(x) ■ (exp» - x) - ($>*f(x) - V M fr{x),v), 

t 2 (v) = (@*f(x)-V M fr(x),v). 
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By the assumption f r <E C 2 (A4) and Lemma A. 3, 

|/ r (ar) - /r(exp a (w)) + (V*i/ r (a:),v>| < ci|^| 2 
for some c\ > 0. By Lemma A.2(iii), there exists £2 > depending on M. only so that 

I/O) • (ex Px (t;)-x)-/(x)-^ B (t;)|<^||/(x)|||t;| a <Q ! «||/l| Jf H a . 
Notice that fix) ■ $> x {v) = (^*{f(x)),v) = (&*f(x),v). So we have 

|*i(v)l<(ci+C2K||/]k)M 2 - 
Using the facts \\v\ 2 < \\x — exp 2 ,(?;)|| 2 < |w| 2 and | < y / dct(G)(u) < |, we obtain 

<|(ci+c 2 ||/1k) 2 / e-M a /(^)|«|*d« 

z J|«|<a 

<^(5i+5 2 ||/|k) 2 S3.s d+4 , 

where £3 = J| t ,| <:L e~''"' 2 / 4 |u| 4 du. 
Denote 

Q a (a:)= / e - |l "- exp - ( " )l|2/(2s2) |t 2 («)| 2 v /det(G)(w)di;. 

JB x (s) 

By the Schwarz inequality 

f C -Il— oxp x W|| 2 /(2. 2 )| tl ( u )|| i2 ( v )| v / dct ( G )( u ) dv 

< (7 e-ll ffi - e ^WII 2 /( 2s2 )|i 1 (7;)| 2 A /det(G)( ? ;)d U > ) 7 v 7 ^) 

\JB x (s) / 

<(ci+ call/Ik) yJ\w*+*y/QJxj. 

Then we get 

er a (s)> / e -ll— — p^c^)|| 2 /C2 S 2 ) ( |^ 2( ^ ) |2 _ 2 |t 1 (-y)||t 2 (^)| — 2 |i x (^) | 2 ) ^/d^tCG) (^) 

> Qs{x) - {cx+M\f\\K)\j\hs d +^VQjx)- \{h + ~C2\\f\\ K ) 2 ~C 3 S d+i . 
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Qs{x) < 3(c! + c 2 \\f\\ K ) 2 c 3 s d+4 + 2er s (x). 

By the facts ||x - u\\ 2 < \v\ 2 : y/dct(G)(v) > § and f B , s) c~\ v \ 2 / { - 2s ^v l v 3 = 0, if i ^ j, 
we obtain 

Qs{x) > (®*f( X ) - VMfr(xM@*f(x) - V M fr(x)) j [ e'^'^ViVj dv 



i,J=l 

= 5 4 .s d+2 |^/>)-V >( / I .( a; )| 2 , 
where c 4 = ^ /|„i<i e~''"' 2 / 2 |w| 2 dv. Therefore, our claim (B.3) holds with 

c' = ^(3(5i + 5 2 ) 2 c 3 + 2). 

Step 2. By (B.3) wc have 

\&Rx)-V M fAx)\ 2 dp M (x) 

x s 

< J ((1 + \\f\\ K ) 2 s 2 + ^T 2 j x ^s{x) dp M (x) 



(B.4) 



By the assumption dp M (£) = v(£)dp and (3.2), we have > s e if x € X s and £ £ 
//,. ... Therefore, 

Integrating both sides over x on X s and using the fact B x>s C M. when x € X s , we obtain 

/ er s (x)dp M (x)<^(£(f)-2a 2 ). 

Plugging into (B.4) gives 
J x \9*?{x) - V^/ r (x)| 2 dp M (x) < c! ((1 + \\f\\ K ) 2 s 2 + 1 ^(£(f) 2a 2 )) . (B.5) 
If dp M = dp, we immediately obtain 

er s (x)dp_ M (x)<£(/)-2a 2 
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and hence 

^ \®*f(x) - V M fr(x)\ 2 dp M (x) < c'((l + ||/|| A -) 2 S 2 + (/) - 2<x 2 )) . (B.6) 

Step 3. Condition (3.1) implies v is continuous. Since .M is compact, sup^g^ v(x) = 65 
exists. So 

(A* \ X s ) < c 2 s + c 5 (l + c 2 )sV(M). 
Together with the fact \@*f(x) - ^ mU{x)\ < n\\f\\ K + ||VM/r||oo> we have 

\@*f(x) - V M fr(x)\ 2 dp M (x) < c 6 (l + ||/]k)V (B.7) 

M\X, 

With C 6 = (« + ||Vjw/r||oo)(c2 +C 5 (1 + C 2 )/i(-M)). 

Combining (B.5) and (B.7) leads to conclusion (B.l). 

If dp M = dp, v{x) = 1 for all x £ M, (3.1) holds with 6 = 1 and c 5 = 1. So (B.7) holds 
with 0=1. This together with (B.6) proves (B.2). 

This finishes the proof. □ 

Define the functional 

A(s,X)= mf (£(/)- 2a 2 + X\\f\\ 2 K ). 

Applying Lemma B.l to fx, we immediately obtain the following corollary. 
Corollary B.l. Under Assumption (~B.l), we have 



If dp M = dp, then 



Wh - VAf/r|li ?jM < ^ (2s + (y + pj^W, A) 

Proof. It suffices to notice that both A 1 1 /a 1 1 If and £(fx) — 2cr 2 are bounded by A(s, A). □ 

Next we estimate A(s,X). We will need the following lemma. 
Lemma B.2. Let f r £ C 2 (M). There exists c 4 > such that for f £ H P K , 
£{f) - 2a 2 < c 4 (s d+i + s d+A \\f\\ 2 K + \\9*f- V M fr\\h )■ 



Learning gradients on manifolds 205 

Proof. Since M. is a submanifold with intrinsic distance, there exists So > such that 
U~x\\ <S implies d M (x,0 <e . Denote A = {£ G M : ||£ - x\\ <S }. Then Ac<B x , eo . 
So 

£{f) - 

< [ [ c-\^\\ 2 '^ 2 \fAx)-fr{i) + f{x)-{i-x)) 2 dp M {i)Ap M {x) 
+ [ [ e-^ 2 /^ 2 \f r (x)-f r (0 + f(x)-^-x)fdp M (0dp M (x) 

J M JM\A 

:=Ji+J 2 . 
It is easy to notice that 

J 2 <5 8 e-^ 2 )(l + ll/1^) 

for some £s > 0. 

Note that for every x <E M. 1 exp~ 1 (B x . £() ) C B x (e). Write £ in cc-normal coordinates. 
Then on £> x , £o there holds ^\v\ 2 < \\x — £|| and ^/det(G)(w) < §. Let h(v), t 2 {v) and c 5 
be the same as in the proof of Lemma B.l. We have 

Ji<25 5 / / e-^--P«WllV( 2 0(|t 1 ( u )|2 + |f 2 ( u )|2) v ^^( u ) dud/3M ( :E ) 

JVW •/expJ 1 (B ai<eo ) 

<25 5 / / c-^ 2 l^ 2 \\t 1 (v)\ 2 + \t 2 (v)\ 2 )^t^(v)dvAp M {x) 

J M J \v\<So 

< C 9 ((l + \\f\\ 2 K )s d+i + S d+2 \\2>*f- V M f r \\h ) 

for some eg > 0. 

Combining the estimates for J\ and J2, we finish the proof. □ 
Lemma B.2 implies the following corollary which bounds A(s, A) by the functional Jf. 
Corollary B. 2. Iff r eC 2 {M), then 

A^)<«(^ + ^*(± + *)) 

with C5 = max(c4, 1). 

Proof. By Lemma B.2, for every / £ 1-L P K , there holds 

£(f)-2a 2 + X\\f\\ 2 K <c 5 (s d ^ + s d+2 \\^f-V M f r \\ 2 L2 + (A + s d+A )\\f\\ 2 K ). 
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The conclusion follows by taking infinium over / £ T-L P K . □ 

One easily sees that Propositions B.2 and B.3 follow from combining Corollaries B.l 
and B.2. 

Remark B.l. We remark that the approximation error estimate in Proposition B.2 
converges to as s -> if Jf (t) = 0(t p ) with /3 > §. The result may be improved by using 
functional analysis techniques; see, for example, [16]. However, it seems those techniques 
require the explicit functional expression of f\, which is available only in the regression 
case. The proof method we provide here is not as powerful for the regression case but it 
is more general and can be applied even in cases where f\ only exists implicitly, such as 
the classification setting. 

Now we prove Theorems 3.1 and 3.2. 

Proof of Theorem 3.1. First note that Jf(t) is increasing with t. Since s < 1 and A = 
s d+2+2e < lj we haye + s 2) = x(s 2B + s 2) < jf(2s 2e ). Then by Corollary B.2, 

\\\fx\\ 2 K < A(s, A) < c 5 (s d + 4 + s d+2 Jf(2 S 2e )). 
Plugging into the sample error estimate in Proposition B.l gives 

with C = C\ + (C 2 + i^a)Vc5- B Y Proposition B.2 

\\h - ^MfrWh < 3C 3 (s e + s- e je(2s 2e )). 

Combining these two estimates, we draw the conclusion with the constant C p = 
max{(C") 2 ,3C* 3 }. □ 

Note that if M. is a compact domain in W ', then d = p, @ = 2>* = I, and V ' m is the 
usual gradient operator. In this case J(T(t) = 0(t) if Vf r € T-L P K - The rate in Corollary 
3.1 becomes 0(?i _e ^ 2p+4+5e ^), which is of the same order as that derived in [16]. This 
implies that our result reduces to the Euclidean setting when the manifold is a compact 
domain in the Euclidean space. 

Proof of Theorem 3.2. By s < 1 and A = s d+3 we obtain J^{-^ps + « 2 ) = J^(s + s 2 ) < 
JT(2s). Then by Corollary B.2, 

M\h\\K < As, A) < c b {s d+i + S d + 2 .^{2s)). 
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Plugging into the sample error estimate in Proposition B.l gives 



\\fD-h\\K< 



C'\og(2/S) 



(1 + y/s-iJtTps)) 



with C = d + (C 2 + i^V^iF- By Proposition B.3, 

1 1 fx - VmM\1* < 3C 3 A S + ^(2s 2 )). 



The conclusion follows by combining the above two estimates. 



□ 
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